MICROCOPY  RESOLUTION  TEST  CHART 

NATIONAL  BUREAU  OF  STANDARDS- 1%3  A 


DEPARTMENT 

of 

COMPUTER  SCIENCE 


DTIC 

SELECT  E 
JUN2  3IS83 

A 


{or,  Cray  Parformance  from  a  VAX) 


Andrew  W.  Appel 
March  1983 


"7  DOCUMENTATION  PAGE 


CMU-CS-83-1 18  AJ> 


A.  TITLE  (on*  Sutintt) 

AN  EFFICIENT  PROGRAM  FOR  MANY-BODY  SIMULATIONS 
(RO,  CRAY  PERFORMANCE  FROM  A  VAX) 


I.C/.D  INSTRUCTIONS 
BEFGRt  COVPLEUNG  FORM 


W  fct'MBCIi 


i.  TYPE  OF  REPORT  *  PERIOD  COVERED 


Interim 


*.  PERFORMING  org.  report  number 


7.  *UTrOR|>; 


*.  CONTRACT  OR  GRANT  NUMBERfiJ 


Andrew  W.  Appel 


N00014-76-C-0370 


».  PERFORMING  ORGANIZATION  NAME  AND  ADDRESS 

Carnegie-Mellon  University 
Computer  Science  Department 
Pittsburgh,  PA  15213 


H.  CONTROLLING  OFFICE  NAME  AND  ADDRESS 

Office  of  Naval  Research 
Arlington,  VA  22217 


PROGRAM  ELEMENT.  PROJECT.  TASK 
AREA  «  PORK  UNIT  NUMBERS 


12.  REPORT  DATE 


March  1963 


13.  NUMBER  OF  PAGES 


S A.  MONITORING  AGENCY  name  4  ADORESSfU  dltttront  trotr-  Conttolltng  Otliem)  IS.  SECURITY  CLASS,  (o!  thtm  roport) 

UNCLASSIFIED 

IS*.  OECLASSI  FI  CATION/ DOWN  GRADING 
SCHEDULE 


<6.  DISTRIBUTION  STATEMENT  (oltfilt  Report) 


Approved  for  public  release;  distribution  unlimited 


17.  DJ5T  ft)  BUT  )DK  STATEMENT  (d  Ihr  obatroct  ont «  tJ  in  Block  3b,  it  dUlmrtnt  it  on  Report) 


IS.  KEY  WORDS  (Ct-ntlnoo  on  r«v«r*i  aid*  il  o •  camamy  mnd  idmnttty  by  block  nw»dar,) 


JO.  ABSTRACT  (Ccntlnur  on  «/ct  1i  ritUMMy  mid  ioantlty  by  biot>  mjsnbmt} 


line  simulation  of  A'  particles  interacting  in  a  gravitational 
simulations  become  cosily  for  large  S.  Representing  the  universe 


/iiational  force  field  is  useful  in  astrophysics,  but  such 


as  a  tree  structure  with  the  particles  at  the 


leaves  and  internal  nodes  labelled  with  the  centers 


of  mass  of  their  descendants  allows  several  simultaneous 


attacks  on  the  computation  time 


united  by  the  problem.  There  approaches  ranee  from  algorithmic  changes 


tfcillCM  Or  '  NCiVf'vl!  CL'iiwtTt 

r/h  <  it'?* c  »•*  f f ci 


\Kcv.i  nr: n 

% r  c i  *Wv  fZ  •' i tV*  i cT*  irv  c»  t  •- r  ^ 


SECURITY  CLASSIFICATION  OF  THIS  FACE  {*>•••  D mm  Lntmr~4 


(replacing  an  0(A»)  algorithm  with  an  0(N  log  AO  algorithm)  to  data  structure  modifications,  code-tuiting, 
and  hardware  modifications.  The  changes  reduced  the  running  time  of  a  large  problem  (A'=  10,000)  by  a  j 
factor  of  four  hundred.  This  paper  describes  both  the  particular  program  and  the  methodology  underlying 

such  spccdups. 


S.N  0103-  Lf-  014-  AftOI 


UNCLASSIFIED 


t  T  i  A  mu  A  Tull  O  A  Ft  M 


CrtU-CS-83-118 


An  Efficient  Program 

for  Many-Body  Simulations 

(or,  Cray  Performance  from  a  VAX) 

Andrew  W.  Appel 
March  1983 


Abstract 

^Vhe  simulation  of  ,V  particles  interacting  in  a  gravitational  force  field  is  useful  in  astrophysics,  but  such 
simulations  become  costly  for  large  N.  Representing  the  universe  as  a  tree  structure  with  the  particles  at  the 
leaves  and  internal  nodes  labelled  with  the  centers  of  mass  of  their  descendants  allows  several  simultaneous 
attacks  on  the  computation  time  required  by  the  problem.  These  approaches  range  from  algorithmic  changes 
(replacing  an  Of/V2)  algorithm  with  an  0(N  log  N)  algorithm)  to  data  structure  modifications,  code-tuning, 
and  hardware  modifications.  'Hie  changes  reduced  the  running  time  of  a  large  problem  (;V=  10,000)  by  a 
factor  of  four  hundred.  This  paper  describes  both  the  particular  program  and  the  methodology  underlying 
such  spccdups^ 
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1 .  Introduction 

Isaac  Newton  calculated  the  behavior  of  two  particles  interacting  through  the  force  of  gravity.  but  he  was 
unable  to  solve  the  equations  for  three  particles.  In  this  he  was  not  alone  [6,  p.  634],  and  systems  of  three  or 
more  particles  can  be  solved  only  numerically.  Itetative  methods  arc  usually  used,  computing  at  each  discrete 
time  interval  the  force  on  each  particle,  and  then  computing  die  new  velocities  and  positions  for  each  particle. 

A  naive  implementation  of  an  iterative  many-body  simulator  is  computationally  very  expensive  for  large 
numbers  of  particles,  where  "expensive"  means  days  of  Cray-1  time  or  a  year  of  VAX  time.  This  paper 
describes  the  development  of  an  efficient  program  in  which  several  aspects  of  the  computation  were  made 
faster.  The  initial  step  was  the  use  of  a  new  algorithm  with  lower  asymptotic  time  complexity;  the  use  of  a 
better  algorithm  is  often  the  way  to  achieve  die  greatest  gains  in  speed  [2]. 

Since  every  particle  attracts  each  of  the  others  by  die  force  of  grav  ity,  there  arc  O(.V-)  interactions  to 
compute  for  every  iteration.  Furthermore,  for  the  same  reasons  that  die  closed  form  integral  diverges  for 
small  distances  (since  the  force  is  proportional  to  the  inverse  square  of  the  distance  between  two  bodies),  the 
discrete  time  interval  must  be  made  extremely  small  in  the  ease  that  two  particles  pass  very  close  to  each 
other.  These  are  the  two  problems  on  which  the  algorithmic  attack  concentrated.  By  die  use  of  an 
appropriate  data  structure,  each  iteration  can  be  done  in  0(/\  log  ,V)  time,  and  the  time  intervals  may  be 
made  much  larger,  thus  reducing  the  number  of  iterations  required.  The  algoridim  is  applicable  to  yV- body 
problems  in  any  force  field  with  no  dipole  moments. 

Using  an  algoridim  with  a  better  asymptotic  time  complexity  yielded  a  significant  improvement  in  running 
time.  Four  additional  attacks  on  the  problem  were  also  undertaken,  each  of  which  yielded  at  least  a  factor  of 
two  improvement  in  speed.  These  attacks  ranged  from  insights  into  the  physics  down  to  hand-coding  a 
routine  in  assembly  language.  By  finding  savings  at  many  design  levels,  the  execution  time  of  a  large 
simulation  was  reduced  from  (an  estimated)  8000  hours  to  20  (actual)  hours.  The  program  was  used  to 
investigate  open  problems  in  cosmology,  giving  evidence  to  support  a  model  of  the  universe  with  random 
initial  mass  distribution  and  high  mass  density. 

This  paper  describes  the  problem  and  its  solution,  considered  from  the  point  of  view  of  a  computer  scientist 
approaching  a  software  engineering  problem.  Thus,  only  a  brief  overview  of  the  physics  is  given;  the 
emphasis  is  on  techniques  of  writing  efficient  software.  Section  2  explains  the  nature  of  the  cosmological 
questions  that  can  be  answered  by  many-body  simulations.  Section  3  describes  some  old  algorithms  for  such 
simulations,  Section  4  introduces  the  data  structure  and  the  algorithm  to  reduce  the  time  per  iteration,  and 
Section  5  shows  how  to  use  the  data  structure  to  reduce  the  number  of  iterations.  Section  6  shows  how  to 
create  the  structure  and  how  to  keep  it  from  becoming  distorted.  Section  7  describes  an  implementation  of 
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the  algorillim.  I  ho  techniques  used  to  attain  spccdups  at  various  design  levels  are  described.  These  speedups 
are  summarized,  and  die  design  methodology  leading  to  them  is  discussed,  in  Section  8. 


2.  Applications  in  Astrophysics 

The  search  for  a  faster  algorithm  to  compute  many-body  interactions  in  a  gravitational  force  field  was 
motivated  by  two  important  questions  in  cosmology  that  can  be  investigated  by  simulating  gravitational 
interactions  of  tens  of  thousands  of  galaxies.  An  efficient  computer  program  has  made  it  feasible  to  do  such 
simulations.  This  section  describes  the  cosmological  applications,  and  die  remaining  sections  describe  the 
program. 

2.1 .  How  Did  Galaxies  Form? 

It  is  generally  believed  that  the  early  universe  was  radiation-dominated,  that  is.  that  most  of  the  energy  of 
the  universe  was  in  the  form  of  photons,  and  the  forces  on  a  typical  particle  were  primarily  electromagnetic. 
The  present  uniscrsc.  however,  is  mass-dominated,  with  most  of  the  energy  condensed  into  massive  bodies 
(such  as  stars),  and  the  primary  interaction  between  these  bodies  being  gravitational  (the  gravitational  force 
between  the  Farth  and  die  Sun,  for  example,  completely  dominates  the  "solar  w-ind"  of  photons  pushing  the 
Fai  th  away  from  the  Sun). 

Ihe  transition  between  a  radiation-dominated  and  a  mass-dominated  universe  probably  took  place 
relatively  suddenly;  after  that,  massive  bodies  such  as  galaxies  began  to  form  (they  would  have  been  torn 
apart  in  a  radiation-dominated  universe).  Two  of  the  competing  theories  describing  die  formation  of 
galaxies  (20]  may  be  characterized  as  "top-down"  and  "bottom-up,"  respectively. 

In  the  "top-down"  theory  [21],  galaxy  clusters  formed  as  a  result  of  long-range  pressure  waves  left  over 
from  the  radiation-dominated  universe.  A  pressure  wave  contains  alternating  regions  of  high  and  low  density. 
When  the  universe  "condensed"  and  the  radiation  disappeared,  there  would  be  no  medium  to  support  the 
waves,  but  the  regions  of  high  and  low  mass-density  would  remain.  It  is  proposed  that  the  regions  of  high 
density  became  super-clusters  of  galaxies;  that  galaxies  formed  within  these  super-clusters;  and  that  stars 
formed  within  the  galaxies.  Two-dimensional  simulations  under  these  assumptions  have  shown  a  cell-like 
structuring  of  the  clusters  (7);  it  is  not  clear  whether  the  dimensionality  of  the  simulation  is  responsible.  It 
may  be  that  these  cells  exist  in  the  present  universe  [12],  but  the  observations  at  large  distances  arc  not 
conclusive. 

In  the  ”bottom*up"  theory  [16],  there  were  no  pressure  waves,  and  the  universe  immediately  after 
condensation  consisted  of  randomly  distributed  hydrogen  molecules.  In  a  random  distribution,  there  will  be 
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local  fluctuations  in  mass  density,  and  as  the  universe  expands,  the  denser  regions  will  tend  to  cohere,  while 
die  regions  oflowcr  density  will  expand.  This  will  tend  to  increase  the  size  of  the  fluctuations,  funning  stars. 
More  expansion  will  increase  the  size  of  the  fluctuations  to  that  of  galaxies  and  eventually  of  clusters  and 
super-clusters  of  galaxies.  The  clusters  will  have  a  more  random  structure  than  in  tire  "top-down"  model. 

In  both  theories,  die  only  significant  interactions  between  galaxies  after  die  condensation  are  gravitational. 
A  simulation  of  the  motion  of  many  particles  with  gravitational  interactions  can  therefore  test  these  theories. 
A  tcn-thousand-galaxy,  three-dimensional  simulation  testing  the  "bottom-up"  theory  (that  is.  starting  with  a 
uniform  random  distribution  of  particle  positions)  has  been  done  using  the  techniques  described  in  the 
remainder  of  diis  paper.  The  result  of  the  simulation  is  clustering  consistent  with  diat  observed  by 
astronomers  (see  Figure  2-1  for  a  picture  of  the  simulations  output).  A  similar  test  of  the  "top-down"  theory 
has  not  yet  been  done,  but  since  this  theory  differs  from  the  "bottom-up"  theory  primarily  in  its  specification 
of  the  distribution  of  the  initial  placement  of  the  particles,  it  could  be  simulated  easily  using  the  same 
algoridrm. 

The  large-scale  simulations  done  using  die  program  described  in  Sections  3  dirough  6  of  this  paper  seem  to 
imply  diat  the  bottom-up  model  can  explain  the  present  mass  distribution  of  the  universe  quite  well,  without 
the  complicated  assumptions  inherent  in  the  top-down  model. 

2.2.  Is  the  Universe  Open  or  Closed? 

One  of  the  fundamental  questions  in  cosmology  is  whether  die  universe  will  continue  expanding  forever,  or 
whether  it  will  eventually  collapse  in  a  gigantic  reversal  of  the  Big  Bang.  One  way  to  answer  this  question  is  to 
look  at  the  mass  density  of  the  universe.  If  the  universe  is  below  a  certain  "critical  density"  then  expansion 
will  continue  forever;  otherwise  it  will  contract.  Unfortunately,  it  is  difficult  to  measure  die  mass  density  of 
the  universe.  Astrophysicists  have  been  able  to  make  estimates;  most  observational  estimates  put  the  mass 
density  at  about  a  tenth  of  the  critical  density.  Since  Truth,  Beauty,  and  Simplicity  demand  that  die  density  of 
the  universe  be  equal  to  the  critical  density  [15],  a  great  astronomical  search  has  been  on  for  die  "missing 
mass."  The  search  is  complicated  by  the  fact  that  many  forms  of  mass  (such  as  black  holes)  arc  difficult  to 
observe  directly. 

This  problem  can  be  avoided  by  approaches  that  do  not  involve  direct  observation  of  the  mass  density. 
One  such  approach  is  dirough  simulation  of  the  gravitational  intcracdons  of  galaxies  under  different 
assumptions  about  the  mass  density.  Groth  et  al.  [10]  observed  in  small  simulations  that  low  mass  densities 
will  not  lead  to  the  amount  of  clustering  actually  observed,  and  that  the  critical  density  would  lead  to  such 
clustering.  The  ten-thousand-body,  three-dimensional  simulation  using  the  program  described  later  in  this 


4 


ANDK1-W  W.M'I'M 


13  Al’KIl.  1983 


Figure  2-1:  Result  of  a  Simulation 

An  initial  randomly  generated  configuration  of  10.000  galaxies,  and  the  result  of  simulating  the  gravitational  interactions  of  this 
configuration  as  the  universe  expands  by  a  factor  of  7.12,  with  mass  density  p  -  p  crn  as  a  parameter  of  die  simulation. 

The  particles  arc  in  a  three-dimensional  space  which  has  been  projected  into  two  dimensions  for  this  picture.  A  periodic  coordinate 
system  is  used  in  which  the  two  extreme  points  in  each  dimension  are  identified.  The  pictures  are  scaled  to  the  expansion  factor  of  the 
simulated  universe. 

paper  was  for  the  higher-density  ease;  large-scale  clustering  was  observed,  lending  support  to  this  theory.  The 
lower-density  case  can  be  examined  by  the  same  techniques. 

3.  Previous  Algorithms 

Because  the  N-body  problem  cannot  be  done  in  closed  form,  the  calculation  must  be  done  numerically. 
That  is,  at  each  time  /,  the  gravitational  forces  of  each  mass  on  each  of  the  others  may  be  computed  by 
Newton’s  laws.  (For  an  appropriate  range  of  distances  --  say,  between  one  and  a  few  hundred  million 
light-years  --  Newton’s  laws  are  a  good  approximation  to  General  Relativity.)  Using  the  inverse-square  force 
law,  an  approximation  to  the  true  acceleration  and  velocity  of  each  particle  over  a  time  di  can  be  computed. 
By  many  iterations  of  this  method,  the  position  of  each  particle  after  an  arbitrary  length  of  time  may  be 
found. 
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3.1.  A  Simple  Algorithm 

Newton’s  law  of  grav  ity  states  that  the  force  between  any  pair  of  particles  is  proportional  to  the  product  of 
their  masses  divided  by  the  square  of  die  distance  between  them.  Stated  as  a  vector  equation. 

where  r,  is  the  position  vector  of  particle  K  r",  is  the  acceleration  vector  of  particle  /.  and  (7  is  the  universal 
grav  itaiional  constant. 

When  there  arc  many  particles,  the  acceleration  of  each  particle  is  given  by  the  sum  of  the  accelerations  (as 
computed  by  Newton's  law)  for  all  the  other  particles.  This  is  simply  a  large  set  of  differential  equations.  For 
two  bodies,  it  is  solvable  in  closed  form;  however,  for  more  than  two  bodies  no  closed  form  solution  exists. 

Hie  differential  equation  can  be  integrated  numerically  using  a  "naive"  algorithm.  At  each  iteration, 
compute  the  acceleration  acting  upon  each  particle:  from  this,  compute  a  modified  velocity  over  the  next  time 
increment,  and  then  compute  the  position  of  each  particle  at  the  end  of  die  time  increment  by  calculating 

rncw  =  rold 

The  time  increment  di  must  be  made  small  enough  diat  die  accelerations  do  not  greatly  change  between  t  and 
t  +  dt. 

There  arc  two  problems  with  this  algorithm.  The  first  is  that  die  number  of  interactions  is  large  as  a 
function  of  the  number  of  particles.  In  particular,  die  gravitational  action  of  each  particle  on  every  other 
DSiticlc  must  be  computed  every  iteration,  requiring  a  total  of  W-N  operations.  When  /V  is  large  (physicists 
would  like  to  simulate  tens  of  thousands  of  particles,  although  they  arc  rarely  able  to  do  so),  an  O^N1) 
algorithm  is  extremely  costly  to  execute. 

The  second  problem  in  many-body  simulations  is  Uiat  it  usually  happens  that  some  pairs  of  particles  in 
such  a  system  will  pass  very  close  to  each  other.  Nearby  particles  in  a  gravitational  field  usually  move  at  high 
speed  with  respect  to  each  other;  the  combination  of  high  velocities  and  small  distances  necessitates  an 
extremely  small  time  increment  between  iterations. 

One  approach  to  these  problems  is  to  use  an  extremely  fast  computer.  The  Cray*l  computer  is  very  fast  at 
algorithms  that  have  a  "vectorizablc"  formulation;  that  is,  problems  which  can  be  expressed  in  terms  of 
clemcnt-by-element  arithmetic  operations  on  long  arrays  of  numbers.  The  acceleration  computation  can  be 
formulated  in  terms  of  such  large  vectors.  If  the  vector  instructions  of  the  Cray-1  arc  used  to  advantage 
(either  by  hand-coding,  or  by  using  the  Cray  Fortran  compiler  with  a  good  understanding  of  what  sorts  of 
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programs  the  compiler  can  generate  efficient  code  for),  the  time  required  to  calculate  die  acceleration 
between  two  bodies  can  be  estimated  at  100  clock  cycles  (40  of  which  arc  needed  for  a  calculation  of  a 
periodic  distance  function  peculiar  to  the  many-galaxy  problem  [3j).  The  time  for  one  clock  cycle  is  12 
nanoseconds  [18|,  and  the  number  of  pairs  of  bodies  is  ,W2.  so  the  time  for  one  iteration  can  be  estimated  at 
.6/V2  microseconds.  Using  scalar  instructions,  or  using  vector  instructions  with  inefficient  pipeline  behavior, 
would  more  than  double  the  time  taken  per  iteration. 

Using  a  similar  program  to  simulate  ten  thousand  bodies  over  one  thousand  iterations  requires 
approximately  8000  hours  of  VAX  time  (this  was  extrapolated  from  observations  of  100-partidc  simulations). 
Table  3-1  gives  the  times  required  for  various  implementations  of  a  straightforward  simulator,  Kven  on  a  fast 
vector  processor  like  the  Cray-1,  this  simulation  takes  several  hours,  line  disadvantage  to  running  the 
simulation  on  the  Cray  computer  is  that  the  Cray-1  is  enormously  expensive:  at  a  cost  of  eight  to  ten  million 
dollars  it  is  about  40  times  as  expensive  as  a  large  minicomputer  such  as  a  VAX.  A  solution  whereby  the 
problem  can  be  solved  in  tens  of  hours  on  the  VAX  would  obviously  be  preferable  to  any  of  the  points  in  the 
solution  space  described  in  the  table  below. 

VAX-11/780  Cray- 1  (estimated) 

Optimizing  Compiler  8000  30 

Hand-optimized  5000  16 

Table  3-1:  Running  times,  in  hours,  of  an  Oi.iV2 )  program 
for  10,000  bodies  over  1,000  iterations. 

3.2.  Other  Algorithms  in  the  Literature 

Two  approaches  have  been  taken  to  reduce  the  computational  cost  of  solving  the  N-body  problem.  One 
approach  is  to  represent  the  problem  in  a  position- velocity. phase  space,  and  transform  the  force  field  using  a 
Fast  Fourier  Transform  into  a  form  where  it  can  be  applied  in  linear  time  [13, 14],  This  takes  0(N  log  N) 
time  (dominated  by  the  Fourier  Transform)  per  iteration.  However,  the  phase  space  must  be  discrete.  This 
means  that  all  positions  must  be  multiples  of  some  lattice  size  a,  and  that  all  velocities  must  be  less  than  some 
maximum  f.  Thus,  the  (physically  interesting)  effects  of  tight  dusters  cannot  be  modelled. 

Another  approach  is  to  keep  track,  for  each  particle,  of  the  sets  of  "nearby"  particles  and  "faraway" 
particles  [1].  The  "faraway"  particles  may  be  integrated  with  larger  time-steps  than  the  "nearby"  particles. 
When  the  particles  are  uniformly  distributed,  this  has  an  asymptotic  complexity  of  0(Af 5).  Unfortunately, 
when  clustering  occurs,  the  number  of  "nearby"  particles  is  in  the  same  order  of  magnitude  as  the  total 
number  of  ps»  -cles,  and  ‘  asymptotic  complexity  is  again  OiN1).  The  problem  of  small  time-steps  is 
attacked  by  usir  c  special-case  technique  for  close  two-body  interactions,  but  this  technique  cannot  be 
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applied  for  light  clusters  of  three  or  more  particles. 

Another  similar  approach  is  to  divide  the  universe  into  cells,  computing  the  particle-particle  interactions 
within  die  cell,  and  then  the  cell-cell  interactions  [1 1],  This  also  has  complexity  0(\r‘ 5 )  for  a  uniform 
distribution,  and  also  degrades  to  a  quadratic  time-complexity  when  clustering  occurs. 

With  none  of  dicsc  algorithms  is  the  problem  of  the  vanishingly  small  discrete  time-step  solved:  in  the 
discrete  phase-space  approach,  the  time  steps  cannot  be  made  smaller  and  Uius  information  is  lost,  while  in 
the  second  and  third  approaches,  the  problem  is  essentially  die  same  as  with  die  "naive"  algoridim. 

4.  Reducing  the  Complexity  of  Each  Iteration 

To  compute  the  force  of  gravity  on  an  apple  exerted  by  the  Kardi.  it  suffices  to  treat  the  Harth  ns  a  point 
mass;  it  is  not  necessary  to  sum  die  forces  exerted  by  each  atom  of  the  Harth.  This  is  a  consequence  of  the 
spherical  symmetry  of  the  Kardi;  Newton  invented  the  integral  calculus  to  prove  this  fact. 

When  an  attracting  body  is  not  spherically  symmetric,  die  result  obtained  by  treating  it  is  a  point  mass  is  no  ' 
longer  exact,  but  it  is  a  good  approximation.  This  approximation  -  in  which  one  attraction  between  a  pair  of 
point  masses  is  calculated,  rather  than  all  the  attractions  between  all  their  constituent  particles  -  is  the  key  to 
reducing  the  asymptotic  complexity  of  computing  the  accelerations  from  0(ff )  to  0(,V  log  N). 

4.1 .  The  Monopole  Approximation 

A  dividc-and-conqucr  algorithm  can  solve  the  many-body  problem  in  0(.V  log  N)  time  per  iteration,  and 
requires  significantly  fewer  iterations.  This  order  time  has  not  been  proved,  but  a  reasonable  argument  is 
given;  furthermore,  experience  with  an  implementation  of  the  algorithm  has  shown  that  it  runs  as  quickly  as 
expected. 

The  algorithm  relies  upon  the  following  approximation:  suppose  there  arc  two  particles,  mx  and  m2,  each 
no  more  than  dr  from  their  center  of  mass  (sec  Figure  4-1).  The  gravitational  attraction  they  exert  upon  an 
observer  situated  a  distance  r  from  the  center  of  mass  will  be 

+  +  om. 

Because  there  is  no  term  in  dr  in  this  equation,  the  approximation  is  good  to  first  order. 

Now  consider  the  arrangement  of  masses  shown  in  Figure  4-2,  which  we  will  suppose  to  be  a  subset  of  the 
particles  in  a  many-body  simulation.  To  compute  the  acceleration  of  each  particle  on  every  other,  we  may 
break  the  computation  into  three  parts:  those  interactions  of  two  particles  which  arc  in  the  left-hand  clump, 
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those  interactions  of  which  both  particles  are  in  the  right-hand  clump,  and  die  interactions  of  a  particle  from 
each  clump.  The  latter  interactions  may  be  approximated  to  order  (dr/>f  by  using  the  approximation 
described  in  the  previous  paragraph:  by  computing  one  interaction,  as  if  each  of  the  two  clumps  were  one 
large  mass.  The  number  of  computations  required  to  calculate  the  inter-clump  interaction  has  thus  been 
reduced  from  n,  n2  to  1;  the  intra-clump  calculation  remains  unchanged. 


Had  the  two  clumps  been  closer  together,  then  the  approximation  would  no  longer  have  been  as  good, 
since  it  depends  on  the  value  of  dr/r.  In  that  ease,  more  calculations  would  have  had  to  be  done. 

4.2.  A  Data  Structure 

A  method  is  needed  for  finding  subsets  of  the  particles  for  which  the  approximation  can  be  made.  This  is 
made  easier  by  the  introduction  of  an  appropriate  data  structure  --  a  binary  tree  whose  leaves  arc  particles  and 
whose  internal  nodes  represent  clumps  of  particles.  Every  node  will  have  an  associated  mass  and  position. 
The  leaves  will  have  the  mass  and  position  of  the  particles  they  represent;  each  internal  node  will  have  a  mass 
equal  to  the  sum  of  the  masses  of  its  two  child  nodes,  and  a  position  equal  to  the  center  of  mass  of  its  child 
nodes.  Also  associated  with  each  clump  (internal  node)  will  be  the  approximate  radius  of  the  clump. 

It  is  now  a  simple  matter  to  compute  all  of  the  gravitational  interactions  between  two  clumps  that  arc  small 
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relative  tu  their  separation,  that  is, 

Jr,/KS  and  Jr:/r<8 

for  some  fixed  criterion  of  accuracy  The  parameters  Jr.  and  Jr,  are  stored  in  the  tree:  die  positions  need 
only  be  subtracted  and  multiplied  by  the  total  masses  of  each  clump  (also  stored  in  the  tree). 

If  the  accuracy  criterion  is  not  satisfied,  that  is.  if  the  clumps  arc  large  and  close  together,  then  the 
calculation  of  the  interaction  of  each  of  the  two  siibcluinps  of  one  clump  with  each  of  the  two  subclumps  of 
the  other  clump  must  be  made.  It  is  not  always  necessary  to  "break  up"  both  clumps  for  this  calculation:  see 
Figure  4-3  for  an  example  in  which  one  clump  satisfies  the  criterion  and  need  not  be  split,  while  the  other 
clump  is  split  into  two  pieces. 


Figure  4-3:  An  example  of  the  calculation  of  clump  interaction 


4.3.  The  Algorithm 

This  algorithm  can  be  coded  as  the  following  pair  of  pseudo-Pascal  recursive  procedures  --  procedure 
ComputcAcccl  computes  all  of  the  accelerations  internal  to  one  clump,  and  procedure  TwoNodc  computes 
the  interactions  between  two  clumps. 

procedure  ComputcAcccl  (  B ) 

begin  if  B  is  a  nontrivial  clump 

then  begin  CompuicAcccl(  Bicrt-child) 

Compute  Acccl(  flnght-child) 

TwoNode(  ^left-child1  ^right-child) 
end 
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procedure  TwoNode(/I.ZJ) 
begin  d  *•  —  r  4 

if  (drA/d>  8  )  and  (dr A>  dr  ft) 
then  begin  TwoNodc(  /l|Cft-Chiid-^) 

I  woN  ode(  A  ngin  thild' 

end 

else  if  drft/d>  8 

then  begin  TwoNodc(  /l./?icrt-chiid) 

TwoNodcf  A,  /^right-child) 

end 

else  begin  Acc^  *■  Acc^  +  GmgA/d 3 
Acc  b  *■  Acc  ft  -  GtnAA/di 

end 

end 

One  detail  that  for  clarity  has  so  far  been  omitted  from  the  description  of  the  algorithm  pertains  to  the 
representation  of  position,  velocity,  and  acceleration  vectors.  Rather  than  storing  at  each  node  the  absolute 
position  of  the  clump  associated  with  that  node,  the  position  vector  from  the  node's  parent  to  the  node  is 
stored.  (The  same  applies  to  velocities  and  accelerations.)  1'his  is  done  in  order  to  minimize  round-off  errors 
in  subtractions,  which  will  be  discussed  in  section  7.  The  absolute  position  of  a  particle  or  clump  may  be 
computed  by  taking  the  sum  of  the  position  offsets  of  all  its  ancestors  up  to  the  root,  though  it  is  rarely 
necessary  to  compute  absolute  positions.  Note  that  the  algorithm  assigns  accelerations  throughout  the  data 
structure,  taking  advantage  of  the  rclativization  of  acceleration  vectors. 

4.4.  Analysis  of  Time  Complexity 

If  the  parameter  8  is  set  to  zero,  then  the  TwoNode  procedure  will  always  recur  down  to  the  level  of 
individual  particles,  and  the  accelerations  assigned  to  the  internal  nodes  will  be  zero.  If  8  is  not  equal  to  zero, 
then  the  absolute  acceleration  of  a  single  particle  will  be  an  approximation  to  the  true  acceleration.  For  values 
of  5  between  0  and  1,  the  time  complexity  of  ComputcAccel  is  estimated  (and  observed)  to  be  0(N  log  N). 

To  see  this,  consider  the  number  of  times  a  particle  X  is  comDarcd  with  other  clumps  for  the  purposes  of 
adding  to  an  acceleration  vector.  Suppose  there  is  a  spherical  shell  around  X  of  radius  r  and  thickness  S  r.  If 
this  shell  is  filled  with  clumps  of  diameter  S  r,  then  there  will  be  4/61  clumps  in  the  shell.  The  smallest  sphere 
will  have  a  size  such  that  the  expected  number  of  galaxies  contained  within  it  is  1;  the  largest  will  enclose  a 
volume  such  that  the  expected  number  of  galaxies  within  it  is  N.  The  quotient  of  the  radii  of  the  largest  and 
smallest  spheres  will  therefore  be  A/1/J.  This  will  be  equal  to  (1  +  5)*,  where  k  is  the  number  of  shells.  Then 
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k  =  !og(.V)/3  log  (l  -F  5),  and  the  number  of  clumps  for  which  there  must  be  calculation  of  accelerations  with 
respect  to  particle  X  is  approximately 
4  log  ,V 

35'  log ( 1  +  5)  ' 

Note  that  this  number  overestimates  the  number  of  calculations  done,  in  that  some  of  the  calculation  will 
involve  not  the  comparison  of  X  with  another  clump,  but  the  comparison  of  an  enclosing  clump  of  .V  with 
another  clump.  That  calculation  would  also  be  counted  in  this  analysis  as  a  calculation  for  A's  sibling  clump, 
and  all  other  subclumps  of  the  encompassing  clump.  However,  this  will  do  no  more  titan  change  the  constant 
of  proportionality:  for  each  of  the  .V  galaxies.  0(  log  .V)  calculations  must  be  done,  giving  a  total  execution 
time  --  for  fixed  5  --  of  0(X  log.V). 

4.5.  Accuracy  of  the  Algorithm 

The  parameter  5  is  a  measure  of  the  accuracy  of  the  calculation.  When  one  clump  is  compared  with 
another,  and  the  ratio  of  diameter  to  separation  is  less  than  5,  then  die  computed  acceleration  will  have  a 
fractional  error  less  than  5:.  When  all  the  accelerations  that  clump  .V  feels  from  other  clumps  arc  summed, 
the  error  in  acceleration  should  be  proportional  to  5:  divided  by  the  square  root  of  the  number  of  clumps 
compared  with  (assuming  random  directions  of  the  error  vector).  A  more  intuitive  explanation  of  this 
statistical  argument  is  that  larger  clumps  will  tend  to  approach  some  sort  of  spherically  symmetric 
distribution,  simply  because  of  the  large  number  of  randomly  positioned  particles.  In  a  perfectly  spherical 
distribution,  the  error  made  in  assuming  that  all  die  mass  is  positioned  at  the  center  is  exactly  zero.  Thus  the 
error  in  acceleration,  on  the  average,  should  be  significantly  less  than  S2. 

In  fact,  the  distribution  of  errors,  shown  in  Figure  4-4.  is  such  that  there  is  a  maximum  absolute  error 
range,  such  that  for  most  particles  the  error  is  quite  small  on  an  absolute  scale.  For  particles  with  large 
accelerations,  the  proportional  error  is  practically  zero.  Figure  4-4  was  computed  by  taking  a  random 
distribution  of  particles  and  using  the  (exact)  results  computed  by  running  the  algorithm  with  5  =  0  as  the 
"Actual"  acceleration  components,  and  using  the  results  computed  with  5=0.3  as  the  "Computed" 
acceleration  components.  The  absolute  errors  arc  the  deviations  from  the  line  y- x,  the  scattcrplot  shows  a 
good  bound  on  the  absolute  error. 

In  those  calculations  where  the  exact  final  positions  of  the  particles  is  not  as  important  as  statistics  about 
their  configurations,  a  relatively  large  value  of  5  can  be  used  (such  as  greatly  reducing  the  constant  factor 
in  the  running  time  of  the  0(N  log  N)  program. 

It  is  useful  to  note  that  although  the  O0V2)  algorithm  has  theoretically  complete  accuracy  in  computing 
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accelerations,  die  fact  that  the  time  intervals  must  be  made  discrete  introduces  approximations  i.ito  any 
numerical  calculation  of  the  .V-body  problem.  15 y  choosing  the  parameters  so  that  the  errors  introduced  by 
each  part  (the  clump  approximation  and  the  discrete-time  approximation)  arc  equal,  die  resulting  error  is 
about  equal  to  that  of  the  standard  algoridim. 

Since  die  use  of  a  clumping  algorithm  to  study  the  formation  of  galaxy  clusters  might  conceivably  be  a 
cause  of  systematic  error,  die  result  of  a  simulation  using  this  algorithm  in  which  no  clustering  occurred  is  of 
interest.  In  this  simulation,  the  galaxies  were  given  higher  initial  velocities  than  predicted  by  theory,  and  no 
measurable  clustering  occurred  (as  seen  both  by  the  human  eye  and  by  a  correlation  function  of  intcrparticlc 
distance). 

5.  Reducing  the  Number  of  Iterations 

When  two  particles  come  very  close  to  each  other  in  an  inverse-square  force  field,  their  accelerations 
become  extremely  high.  To  model  dicir  behavior  accurately,  extremely  small  time  steps  arc  required.  In  any 
simulation  with  a  large  number  of  particles,  there  arc  bound  to  be  a  few  such  pairs  at  any  given  tin*  ,  these 
pairs  require  the  time  increments  of  die  simulation  to  be  so  small  that  the  number  of  iterations  required  to 
integrate  over  a  significant  interval  of  time  becomes  prohibitively  large. 

One  widely  used  solution  to  this  problem  modifies  the  force  law  to  limit  the  accelerations  at  small 
distances.  The  inherent  problem  with  diis  approach  in  the  modelling  of  galaxy  clustering  is  that  the  clustering 
occurs  (and  should  be  examined  by  the  simulation)  over  all  distance  scales.  To  tamper  with  the  force  lav/  at 
small  distances  makes  any  conclusions  about  clustering  at  these  distances  suspect. 

Fortunately,  the  data  structure  introduced  in  the  previous  section  leads  to  a  solution  to  this  problem  that 
preserves  the  inverse-square  properties  of  the  force  law  at  all  distance  scales.  In  section  5.1  an  aspect  of  the 
calculation  open  to  algorithmic  attack  is  described,  and  the  attack  itself  is  explained  in  sections  5.2  and  5.3. 


5.1 .  Characteristic  Times 

The  time  increment  dt  between  iterations  is  determined  after  each  iteration.  The  usual  approach  is  to  use  a 
global  dt  for  all  particles.  In  order  to  avoid  gross  inaccuracies  at  very  small  distances,  the  minimum 
characteristic  time  over  all  particles  must  be  used  for  dt.  The  characteristic  time  of  an  object  is  a  measure  of 
how  long  it  takes  for  that  object’s  acceleration  to  change  significantly;  the  time  will  be  much  shorter  for  a 
particle  tightly  orbiting  a  neighbor.  The  occasional  tight  pairs  and  threesomes  require  an  expensively  small  j 

value  for  dt  in  the  naive  algorithm.  s 
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*  igu«  4*4:  Scattcrplot  of  components  of 

actual  vs.  computed  accelerations  for  fi  =  0.3 

The  characteristic  time  for  a  clump  C  is  the  time  in  which  a  child  of  C  will  move  a  distance  of 
approximately  5  times  the  child’s  distance  from  Cs  center  of  mass.  This  is  easy  to  calculate,  since  the  position 
vector  of  each  is  stored  as  the  vector  from  (the  center  of  mass  of)  C.  So  the  characteristic  time  of  C  is  the  min 
over  both  children  of  fv  and  t#  where 

$*1^1  =  fyX|  V\ 

(Note  that  P,  V \  and  A  arc  the  position,  velocity,  and  acceleration  vectors  of  the  children  relative  to  the  center 
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of  mass  of  C.)  In  each  iteration,  the  accelerations  arc  computed  by  ComputcAccel,  the  minimum 
characteristic  time  dl  is  found,  and  then  Move  calculates  the  new  positions  and  velocities.  Calculating  the 
minimum  characteristic  time  over  the  entire  universe  leads  to  an  exceedingly  small  dt ,  however.  Suppose  two 
or  three  galaxies  get  into  a  tight  orbit  around  each  other;  their  characteristic  time  may  be  an  order  of 
magnitude  shorter  than  the  characteristic  time  of  any  other  object  in  the  universe. 

It  would  be  nice  to  be  able  to  iterate  small,  very  tight  clusters  at  shorter  time  intervals  than  die  rest  of  die 
universe,  saving  a  large  amount  of  calculation.  This  is  not  too  difficult:  what  is  needed  is  a  concise  criterion  to 
distinguish  such  clumps. 

5.2.  Indivisible  Clumps 

Let  such  a  clump  be  considered  to  be  one  object,  indivisible,  of  nonzero  radius.  Indivisibility  will  be 
defined  as  follows;  a  clump  is  indivisible  if  for  all  clumps  outside  it,  its  ratio  of  size  to  distance  is  less  than  5. 
What  indivisibility  effectively  means  is  that  an  indivisible  clump  will  never  -  throughout  the  course  of  the 
acceleration  calculations  for  one  iteration  --  be  "split"  by  procedure  TwoNodc  to  calculate  accelerations  of  its 
subclumps  with  respect  to  any  other  clump.  This  is  easy  to  detect  --  simply  mark  clump  A  in  die  first  then 
clause  or  clump  B  in  tire  second  then  clause  of  procedure  TwoNodc.  Any  clump  that  is  never  marked  during 
tire  process  of  computing  all  the  accelerations  is  indivisible. 

The  reason  that  this  criterion  is  chosen  is  that  it  characterizes  very  well  the  set  of  clumps  such  that  the 
external  gravitational  field  acting  upon  them  is  an  almost  constant  function  of  position  within  the  clump.  In 
fact,  the  monopolc  approximation  has  the  effect  of  assuming  that  this  field  is  constant,  and  the  improved 
moving  algorithm  described  below  takes  advantage  of  this  fact. 

Procedure  Move,  procedure  ComputcAccel,  and  the  procedure  that  determines  dl  will  be  altered  so  that 
they  never  look  at  the  internal  structure  of  such  a  clump.  Note  that  TwoNodc  need  not  be  altered,  since  the 
way  indivisible  clumps  arc  defined  implies  that  TwoNodc  never  looks  at  their  internal  structure.  Now  the 
problem  is  gone:  the  small,  tight  duster  of  galaxies  has  become  a  point  (although  with  nonzero  radius).  The 
time  increment  dt  will  be  much  larger  than  it  could  have  been  otherwise. 

The  internal  motions  and  accelerations  of  these  tight  clumps  will  have  to  be  computed  every  iteration,  and 
in  fact  it  will  take  several  local  iterations  of  the  tight  clump  to  compute  its  motion  over  the  time  interval  dl. 
However,  these  iterations  of  three  or  four  objects  are  replacing  iterations  over  the  entire  universe. 
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5.3.  Closed  Form  Calculations 

When  an  indivisible  object  itself  is  a  clump  containing  two  indiv  isiblc  subclumps  (these  will  usually  be 
simply  individual  galaxies),  then  its  orbit  may  be  solved  in  closed  form.  In  this  ease,  the  calculations  to 
resolve  internal  moti-.  may  be  postponed  until  another  clump  gets  near  enough  to  see  the  internal  structure 
of  the  object.  ITiis  may  be  many  iterations  of  the  universe  later  •-  and  many  times  more  iterations  of  the  tight 
pair,  which  typically  has  a  much  shorter  characteristic  time.  Only  one  calculation  needs  to  be  made  in  closed 
form  to  replace  these  many  iterations:  furthermore,  this  calculation  will  be  exceedingly  accurate,  since  no 
approximations  arc  being  made  internally  to  the  subsystem. 

Since  indivisibility  may  occur  at  several  distance  scales  (indivisible  clumps  may  contain  clumps  which 
themselves  contain  indivisible  clumps,  and  so  on)),  the  tight-clump  calculations  (of  which  the  two-body  closed 
form  calculation  is  a  special  ease)  may  done  recursively. 

6.  Managing  the  Data  Structure 

The  efficiency  of  all  parts  of  the  algorithm  depends  on  having  the  structure  of  die  tree  of  clumps  accurately 
reflect  the  structure  of  the  particles  in  the  simulated  space.  Under  the  influence  of  gravity,  the  particles  move, 
distorting  die  tree.  The  structure  must  be  maintained  and  the  distortions  removed  regularly.  Fortunately,  this 
can  be  done  in  a  simple  way. 

6.1 .  Reorganizing  the  Tree 

After  moving  clumps  that  arc  not  indivisible,  the  coordinates  of  a  clump  will  no  longer  correspond  exactly 
to  the  center  of  mass  of  the  two  subclumps.  This  is  due  to  a  nearby  object  attracting  one  subclump  more 
strongly  than  the  other.  It  is  a  simple  matter,  however,  to  adjust  the  position  of  each  clump  after  its 
subclumps  have  been  moved.  Sometimes,  however,  another  subclump  will  intrude  into  a  clump  so  that  the 
clumps  no  longer  represent  disjoint  (in  the  simulated  thrcc-spacc)  clusters.  In  this  ease,  it  is  necessary  that  the 
clumps  be  rearranged  (while  keeping  the  actual  galaxies  fixed).  The  condition  to  aim  for  is  this:  for  every 
clump  C,  the  closest  clump  to  C  external  to  C  shall  be  its  parent  clump.  Let  Closcst(C)  be  the  nearest  clump 
with  which  C  is  compared  during  the  execution  of  procedure  TwoNodc.  If  the  distance  from  C  to  Closest C) 
is  less  than  the  distance  from  C  to  its  parent,  then  a  new  clump  IV  will  be  formed,  which  will  become  the 
subclump  of  Parcnt(C)  in  place  of  C.  IV  will  contain  as  subclumps  C  and  Closcst(C).  Now  the  old  parent 
clump  of  Goscst(C)  has  only  one  subclump,  so  it  can  be  liquidated,  "promoting"  its  subclump.  This  process 
is  represented  in  Figure  6-1. 

These  adjustments  (which  shall  be  known  as  Grabs)  take  place  immediately  after  procedure  ComputeAcccl 
finishes  running.  Each  Grab  is  a  purely  local  phenomenon  in  the  data  structure  (only  affecting  four  nodesX 


16 


ANDREW  W.  APPEL 


13  APRIL  1083 


Figure  6-1:  Rearrangement  of  Gumps 

and  preserves  the  positions,  velocities,  accelerations,  and  all  other  important  data  of  the  clumps  involved.  The 
process  of  Grabbing  guarantees  that  close  pairs  will  be  subclumps  of  the  same  clump,  and  that  the  clumps  will 
be  close  to  optimally  arranged  for  quickly  computing  accelerations.  Although  the  Grab  algorithm  docs  not 
find  the  "best"  arrangement  of  clumps,  it  has  been  observed  to  do  a  fairly  good  job  in  a  very  short  time  (see 
Figure  6-2). 

The  TwoNodc  procedure  that  calculates  tile  accelerations  throughout  the  tree  also  stores  information  which 
is  used  by  the  rearrangement  algorithm  in  finding  candidates  for  Grabs.  The  rearrangement  is  done  after 
every  iteration:  it  takes  time  linear  in  the  number  of  particles. 

6.2.  Creating  the  Tree 

While  grabbing  is  very  useful  in  maintaining  the  clump  structure  in  the  face  of  distortions,  it  will  not  be 
able  to  create  one  in  the  first  place  from  a  randomly  arranged  set  of  galaxies.  This  will  be  done  as  follows. 
The  universal  clump  --  which  contains  all  the  galaxies  --  will  be  divided  initially  into  two  subclumps  chosen  so 
that  the  first  contains  all  galaxies  whose  x  coordinate  is  less  than  the  median  x  coordinate,  and  the  other 
subctuinp  will  contain  all  galaxies  with  x  larger  than  or  equal  to  the  median  x. 

Each  of  those  subclumps  will  be  divided  into  two  sub-subclumps  using  the  median  y  as  the  splitting 
criterion.  Each  lower  level  of  clump  will  be  split  on  z,  then  x,  then  y,  then  z,  until  the  clumps  consist  of  one 
galaxy  each.  Note  that  this  procedure  docs  not  require  that  the  number  of  clumps  be  a  power  of  two, 
although  that  might  seem  most  natural. 

This  structure  is  known  as  a  *-d  tree  {4],  It  has  a  variety  of  applications  in  multidimensional  problems, 
including  searching,  nearest- neighbor  calculations,  classification,  numerical  integration,  and  computing 


Figure  6-2:  F.ffcct  of  the  Grab  Algorithm 

Figure  6-2  illustrates  the  effect  of  the  grab  algoriihm  on  a  two-dimensional  universe.  The  diagram  on  the  left  depicts  the  clump 
structure  as  first  created,  by  alternately  splitting  at  the  median  x  and  y.  rhe  diagram  on  ihc  right  shows  the  structure  after  several 
iterations  of  Grab.  Note  that  the  panicles  are  in  the  same  positions,  but  the  structure  is  cleaner  -  close  pairs  arc  now  all  linked  directly 
together.  This  improved  structure  may  be  measured  by  the  fact  lhal  the  acceleration  calculation  on  the  improved  structure  is  empirically 
observed  to  be  about  twice  as  efficient  as  on  the  original  structure. 

minimum  spanning  trees  [8, 5, 9].  For  a  many-body  application,  a  standard  k-d  tree  will  be  far  from  optimal 
--  nearby  objects  will  not  be  in  the  same  clump  much  of  the  time.  The  Grab  procedure,  though  its  behavior  is 
difficult  to  analyze  theoretically,  has  been  observed  to  do  a  very  good  iob  of  cleaning  up  the  structure  in  just 
two  or  three  iterations  (sec  Figure  6-2). 

7.  Implementation  of  the  Program 

Various  algorithmic  attacks  using  the  ccntcr-of-mass  tree  structure  have  been  described  in  the  preceding 
sections.  It  is  inappropriate  to  stop  seeking  reductions  in  running  time  after  a  good  algorithm  has  been  found, 
however;  significant  efficiencies  can  be  achieved  in  the  implementation  of  a  given  algorithm. 

The  algorithm  as  described  was  first  implemented  in  about  1200  lines  of  Pascal  on  a  VAX-11/780.  For  a 
problem  size  of  10,000  galaxies,  this  first  implementation  runs  in  about  forty  minutes  per  iteration,  and  about 
500  iterations  are  required  to  simulate  the  expansion  of  the  universe  by  a  factor  of  100.  Under  the  General 
Relativistic  assumptions  made,  letting  t  run  from  1  to  1000  causes  the  distance  scales  to  run  fh>m  1  to  100, 
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because  distance  is  proportional  to  t2/\  Accelerations  arc  transformed  at  each  iteration  to  correspond  with  the 
changing  disuince  scale.  (3j  Thus,  340  hours  of  execution  time  would  be  needed  for  this  program,  as  opposed 
to  8000  for  die  0(jV-  )  algoridim.  The  dmes  given  throughout  this  paper  are  for  a  slight  modification  of  the 
algorithm  to  simulate  a  periodic  distance  function,  which  was  necessary  in  the  initial  application.  This  adds  a 
small  constant  factor  to  all  distance  calculations  (eighteen  floating  point  instructions,  or  about  25%  of  die 
running  time). 

A  profiler  was  used  to  identify  those  parts  of  the  program  diat  consumed  most  of  the  processing  time  [19]. 
The  profiler  operates  by  asynchronously  sampling  the  computer's  program  counter  60  times  per  second  and 
incrementing  the  appropriate  bin  of  a  distribution  function.  'Hie  results  showed  that  all  but  two  percent  of 
the  execution  time  was  spent  in  the  TwoNodc  procedure.  This  is  not  unexpected,  as  TwoNodc  is  the  only 
part  of  the  algorithm  with  an  order  time  of  0(N  log  N):  the  rest  of  die  procedures  run  in  O(N)  time.  Since 
TwoNodc  is  relatively  small,  hand  optimization  of  die  machine  code  was  an  obvious  step.  Writing  this 
procedure  in  assembly  language  resulted  in  a  speedup  by  a  factor  of  two  and  a  half.  This  rewriting  used 
standard  techniques,  such  as  keeping  more  quantities  in  registers,  putting  procedure  calls  in-line,  and  using 
the  addressing  modes  of  the  VAX  more  effectively. 

At  this  point  we  found  that  the  us'c  of  the  Floating  Point  Accelerator  option  on  the  VAX  significantly 
improves  the  performance  of  the  program.  The  program  was  sped  up  by  a  factor  of  two  by  moving  the 
calculations  to  a  VAX  on  which  an  Accelerator  had  been  installed. 

Many-body  calculations  usually  require  double  precision  arithmetic  because  of  die  wide  range  of  distances 
involved.  Close  orbits  arc  often  more  than  four  orders  of  magnitude  --  a  dozen  binary  digits  --  closer  than  the 
distance  to  a  far-away  galaxy.  Since  the  improved  algorithm  stores  all  positions  relative  to  the  parent  clump, 
this  problem  disappears  --  typically  only  one  order  of  magnitude,  or  less,  is  involved  in  the  difference  between 
the  size  of  a  clump  and  the  size  of  its  parent  clump.  The  use  of  32-bit  floating  point  numbers  in  place  of 
64-bit  floating  point  halved  the  running  time  of  the  algoridim. 

The  factors  of  two  in  speed  from  the  use  of  the  Floating  Point  Accelerator  and  from  the  use  of  single 
precision  arc  approximate  and  interdependent.  Table  7-1  shows  the  running  time  as  a  function  of  these 
variables. 

Since  tight,  "indivisible"  clumps  are  recognized  and  their  small  time  constant  docs  not  affect  the  time 
constant  of  the  universe,  far  fewer  iterations  are  required.  Typically,  such  clumps  arc  iterated  about  four 
times  for  each  iteration  of  the  universal  clump.  Each  of  those  iterations  would  have  been  a  global  iteration  in 
the  straightforward  algorithm,  as  in  that  algorithm  there  is  no  way  to  detect  a  tight  clump.  A  conservative 
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32-bit  Moating  Point  64-bit  Moating  Point 
With  KPA  Hardware  16  28 

Without  I'PA  Hardware  25  74 

Table  7-1:  Running  times,  in  seconds,  of  an  acceleration  calculation 
for  1,000  bodies  on  a  VAX- 1 1/780. 

estimate  of  the  number  of  iterations  saved  is  50%  --  a  factor  of  two  speedup. 

In  the  astrophysical  applications  described  in  Section  2.  in  which  galaxy  clustering  occurs,  the  development 
of  clusters  among  the  particles  simulated  leads  to  greater  opportunities  for  procedure  TwoNodc  to  apply  its 
approximation.  The  resulting  gain  is  an  empirically  observed  two-fold  speedup  in  computation  of 
accelerations. 

The  program  that  resulted  from  these  modifications  to  a  very  simple  iteration  method  succeeded  in 
reducing  the  running  time  of  a  simulation  from  4000  hours  to  20  hours  --  a  factor  of  two  hundred  (for  ten 
thousand  bodies,  with  52=0.3).  This  saving  was  achieved  by  attacking  die  problem  from  several  angles  at 
once. 


8.  Conclusions 

It  is  often  very  difficult  to  make  one  change  in  a  program  that  makes  it  faster  by  more  than  one  order  of 
magnitude.  In  this  case,  even  a  change  that  reduced  the  order  time  of  the  algorithm  from  OfV3)  to 
0(i\  log  iV)  increased  the  efficiency  for  a  typical  problem  size  by  only  a  factor  of  12  --  one  o'Jcr  of 
magnitude.  The  four-hundred- fold  reduction  in  running  time  was  die  product  of  savings  at  all  levels  of  the 
conceptual  hierarchy,  from  the  idea  that  some  galaxies  arc  in  systems  by  themselves,  to  the  idea  dial  keeping 
certain  pointers  in  registers  saves  memory  references  (sec  Table  8-1).  They  arc  in  some  sense  independent 
--  improving  the  efficiency  of  one  level  of  the  hierarchy  docs  not  preclude  improving  the  efficiency  of 
another.  Most  importantly,  all  of  the  savings  arc  multiplied  together. 

Reddy  and  Newell  [17]  have  characterized  the  type  of  problem  for  which  diis  multiplicative  speedup  can  be 
expected:  such  a  problem  has  four  to  eight  layers  of  implementation,  such  as  computer  technology, 
architecture,  algorithm,  et  cetera.  This  paper  has  been  concerned  with  ways  to  avoid  changes  in  the 
technology  and  architecture  layers  (i.c.,  using  a  Cray-1)  because  of  their  expense.  Rather,  the  algorithms, 
"knowledge  sources,"  and  implementation  layers  have  been  attacked. 

This  brings  the  running  time  of  the  algorithm  on  a  relatively  smalt  and  inexpensive  computer  such  as  the 
VAX  down  to  what  it  would  be  on  a  large,  extremely  fast,  and  expensive  Cray-1.  Of  this  speedup,  a  factor  of 
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Level  Sneedur 

i  Factor 

Descriotion 

Algorithm 

12 

Changing  to  die  0(.V  log  .V)  algorithm 

Problem-Specific 

Knowledge 

2 

Iterating  indivisible  clumps  by  themselves  and  using 
closed-form  solutions,  thus  liaising  the  number  of  global  iterations. 

Algorithm 

(Problem-Specific) 

2 

Clustering  behavior  in  the  simulation  produces  a  clump 
structure  well-suited  to  the  algorithm 

System-Independent 

Code  Tuning 

2 

Use  of  single  precision  floating  point  rather  than  double 
precision,  made  possible  by  the  data  structure 

System- Dependent 

2.5 

Hand-coding  the  routine  where  most  of  the  time  was  spent 

Hardware 

2 

Use  of  the  Roating-Point  Accelerator 

Tabic  8-1:  Summary  of  the  speedups  attained  at  various  levels 

about  two  was  attributable  to  technology  (the  use  of  die  Routing  Point  Accelerator)  and  two  to 
implementation  (hand  coding  a  critical  routine)  --  these  could  be  done  for  any  program,  probably  with  similar 
results.  The  other  factor  of  a  hundred  (for  ten  thousand  bodies)  came  from  the  exploitation  of  die  data 
structure  in  various  ways.  The  use  of  a  good  data  structure  to  provide  an  asymptotically  fast  algorithm  is 
especially  important  for  large  problems. 

Since  the  layers  of  the  problem  are  rcladvcly  independent,  die  technology  and  architecture  layers  arc  still 
available  for  additional  speedup  factors.  If  the  program  were  run  on  a  Cray- 1  or  a  Cyber  7600,  the  20  hours 
of  runtime  might  be  reduced  to  1  or  2  hours,  since  most  of  the  efficiency  improvements  described  in  this 
paper  arc  machine-independent,  and  these  computers  arc  much  faster  than  the  VAX  (and  almost 
proportionally  more  expensive). 

The  data  structure  is  a  variant  of  one  already  known  in  the  literature  (the  k-d  tree),  but  the  reorganization 
of  the  tree  with  the  Grab  procedure  changes  it  substantially  --  it  loses  the  useful  (for  some  applications) 
property  of  being  split  along  planes  of  constant  x.y,  and  r,  and  gains  the  useful  (for  this  application)  property 
of  joining  mutually  nearest  neighbors  at  all  levels  of  the  hierarchy.  For  the  simulation  of  gravitational 
attractions,  this  turned  out  to  better  than  halve  the  number  of  calculations.  Reorganized  trees  may  have  other 
applications  as  well;  for  example,  the  recognition  of  individual  objects  from  the  point  set  obtained  from  a 
television  camera  might  be  facilitated  by  an  algorithm  that  could  group  points  together  in  0(N  log  N)  time. 
Some  sorts  of  nearest-neighbor  searching  might  also  be  made  easier. 


AN  IU  1  ICIllM  PROGRAM  I  OR  MANY  BODY  SIML'I A I  IONS  21 

It  is  difficult  tu  analyze  the  properties  of  the  Grab  algorithm.  It  is  low-level  in  nature:  when  two  points  arc 
found  to  be  closer  to  each  other  than  to  dieir  parent  nodes,  a  local  rearrangement  is  done  without  regard  for 
the  global  structure  of  the  tree.  That  it  works  as  well  as  it  does  was  difficult  to  predict.  Its  behavior  is 
dependent  on  5.  since  these  closest  pairs  arc  detected  during  the  TwoNodc  procedure:  the  question  of  what  8 
to  use  to  most  efficiently  produce  a  reorganized  tree  (independent  of  gravitational  considerations)  might  be 
investigated  if  reorganized  trees  arc  found  to  be  useful  in  other  applications. 
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